Replace cfn-hup on compute nodes with systemd timers to signal updates #3070

hgreebe · 2025-12-17T13:28:36Z

Description of changes

Changes how compute nodes handle in place updates so that we no longer rely on cfn-hup running on compute nodes.
Replaces cfn-hup with a systemd timer that periodically checks a file in shared storage and runs an update if that file has been modified with the new cluster config version that has not yet been applied to the compute nodes. This file is updated by the head node when a cluster update occurs in order to signal to the compute nodes to update.
The check-update.service has a 30 second timeout so that it does not run indefinately if something hangs.
Revert changes for in_place_update_on_fleet_enabled from: 6eda378#diff-6d6c58cce2dd575c0638ee245d9647b0dfa3cbdef86a136bd816d00538529fb4

Tests

Created unit tests to cover the systemd timer and service as well as new update logic
Ran all the update integ tests

References

PR to remove validator test for in_place_update_true chef attribute: Remove in_place_update_true chef attribute from tests aws-parallelcluster#7156

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

gmarciani · 2025-12-22T18:32:47Z

cookbooks/aws-parallelcluster-computefleet/files/check_update/check-update.timer

@@ -0,0 +1,11 @@
+[Unit]
+Description=Check file modification time every minute


The description must be more specific: it must be more clear that this is a timer configured by pcluster (add this info in the description and in the file name) and the goal of the timer (checks file modifications, which files, why?)

gmarciani · 2025-12-22T18:45:16Z

cookbooks/aws-parallelcluster-computefleet/files/check_update/check-update.timer

+Description=Check file modification time every minute
+
+[Timer]
+AccuracySec=1s


The accuracy has an impact on CPU wake-ups.
The finer the accuracy, the higher the chances to generate load on the CPU.
Since we are using a 60sec activation, what about using a 20s accuracy?

See https://www.freedesktop.org/software/systemd/man/latest/systemd.timer.html

gmarciani · 2025-12-22T18:47:18Z

...books/aws-parallelcluster-computefleet/recipes/config/config_check_update_systemd_service.rb

+# Cookbook:: aws-parallelcluster-slurm
+# Recipe:: config_compute
+#
+# Copyright:: 2013-2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.


Here and in other files: fix copyright year with Copyright:: 2025 Amazon.com

gmarciani · 2025-12-22T18:47:31Z

...books/aws-parallelcluster-computefleet/recipes/config/config_check_update_systemd_service.rb

+
+#
+# Cookbook:: aws-parallelcluster-slurm
+# Recipe:: config_compute


Fix recipe name

gmarciani · 2025-12-22T19:35:40Z

...books/aws-parallelcluster-computefleet/recipes/config/config_check_update_systemd_service.rb

+# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
+# limitations under the License.
+
+template '/etc/systemd/system/check-update.service' do


Here, above and below: the name of the service and the corresponding timer must be more talkative, expressing that they are pcluster services related to the cluster updates

gmarciani · 2025-12-22T20:10:41Z

CHANGELOG.md

 ------

+**CHANGES**
+- Replace cfn-hup in compute nodes with systemd timers to support in place updates.


We must surface the value of this change for the user, which is providing better performance.
Also we should surface that the new mechanism relies on shared storage sync between head node and compute fleet.

gmarciani · 2025-12-22T20:13:35Z

cookbooks/aws-parallelcluster-slurm/recipes/update/update_head_node.rb

 chef_sleep '15'

-wait_cluster_ready if cluster_readiness_check_on_update_enabled?
+wait_cluster_ready


Why removing this?
We should keep it as it may be useful to keep a way for the user to skip the readiness check if necessary.

gmarciani · 2025-12-22T20:14:24Z

cookbooks/aws-parallelcluster-shared/libraries/helpers.rb

-  node['cluster']['node_type'] == 'HeadNode' || node['cluster']['in_place_update_on_fleet_enabled'] == 'true'
-end
-
-def cluster_readiness_check_on_update_enabled?


Same as https://github.com/aws/aws-parallelcluster-cookbook/pull/3070/changes#r2641117741

I think it is useful to keep a way to skip cluster readiness check through chef attributes.
The current chef attribute name must be changed as in_place_update_on_fleet_enabled will not be valid anymore. We could expose something like cluster/update/cluster_readiness_check_enabled

gmarciani · 2025-12-22T20:17:18Z

cookbooks/aws-parallelcluster-shared/attributes/cluster.rb

 default['cluster']['nfs']['hard_mount_options'] = 'hard,_netdev,noatime'
-
-# Cluster Updates
-default['cluster']['in_place_update_on_fleet_enabled'] = 'true'


See https://github.com/aws/aws-parallelcluster-cookbook/pull/3070/changes#r2641119272

gmarciani · 2025-12-22T20:20:43Z

cookbooks/aws-parallelcluster-shared/attributes/cluster.rb

+default['cluster']['shared_update_path'] = "#{node['cluster']['shared_dir']}/check_update"
+default['cluster']['update_checkpoint'] = "#{node['cluster']['scripts_dir']}/update_checkpoint"


I suggest to use more talkative names, such as

default['cluster']['update']['trigger_file'] = "#{node['cluster']['shared_dir']}/update_trigger" default['cluster']['update']['checkpoint_file'] = "#{node['cluster']['scripts_dir']}/update_checkpoint"

With such names for the attributes we better convey that:

they are files (when an attribute contains a directory path, it has dir as suffix)

one is used as a trigger

gmarciani · 2025-12-22T20:25:25Z

cookbooks/aws-parallelcluster-computefleet/templates/check_update/check-update.service.erb

+fi'
+
+[Install]
+WantedBy=multi-user.target


This service is triggered by a timer. What is the point of having WantedBy=multi-user.target?
I think it could be removed as it is actually unused.

gmarciani · 2025-12-22T20:33:52Z

cookbooks/aws-parallelcluster-computefleet/templates/check_update/check-update.service.erb

+[Unit]
+Description=Check for recent file modifications
+
+[Service]


We are missing logs for the logic executed by this service.
I suggest to set the property StandardOutput to log to a local log file. We must then push such file to cloudwatch.

Also, the logic should log more information:

when it starts

if the file exists

what is the content of the checkpoint file

what is the content of the trigger file

the decision taken

when it completes

All log lines must be timestamped with millis resolution.

gmarciani · 2025-12-22T21:20:18Z

cookbooks/aws-parallelcluster-computefleet/templates/check_update/check-update.service.erb

+
+[Service]
+Type=oneshot
+TimeoutStartSec=30


I suppose this timeout is here to prevent the service being stuck in case of file system unresponsiveness.
This is a good strategy as it is important to put a timeout when dealing with remote resources.
However, I think TimeoutStartSec may not be the right parameter to achieve this.

According to the official documentation for TimeoutStartSec:

If a daemon service does not signal start-up completion within the configured time, the service will be considered failed and will be shut down again.

So the timeout covers the whole start logic. The start logic here includes the execution of cfn-hup-update-action.sh, which can legitimately last longer than 30 seconds. So with the current approach we have the risk of interrupting legitimate updates.

If you agree with this risk, I suggest to configure the timeout logic differently:

define a timeout of 20 seconds to read the shared files (20s is enough to capture file system failures with a simple read operation on a single file)

define a timeout for the update action, which must be set through the node_bootstrap_timeout parameter, as it currently is. See https://github.com/gmarciani/aws-parallelcluster-cookbook/blob/wip/mgiacomo/3141/fix-build-ubhu-1218-1/cookbooks/aws-parallelcluster-environment/templates/cfn_hup_configuration/cfn-hook-update.conf.erb#L9-L9

hgreebe added 2 commits December 16, 2025 13:00

Use systemd timers to trigger updates

61ff8e4

Add unit test to cover systemd timer logic for updates

d7a8aea

hgreebe requested review from a team as code owners December 17, 2025 13:28

Update CHANGELOG

0c05f11

hgreebe force-pushed the develop branch from bff72e7 to 0c05f11 Compare December 17, 2025 14:38

Remove in_place_update_on_fleet_enabled chef attribute

e568ae2

hgreebe force-pushed the develop branch 2 times, most recently from 5ce2f72 to 96d1cbd Compare December 17, 2025 15:38

Fix failing unit tests

329c17c

hgreebe force-pushed the develop branch from 96d1cbd to 329c17c Compare December 17, 2025 15:41

hgreebe mentioned this pull request Dec 17, 2025

Remove in_place_update_true chef attribute from tests aws/aws-parallelcluster#7156

Open

gmarciani reviewed Dec 22, 2025

View reviewed changes

		@@ -0,0 +1,11 @@
		[Unit]
		Description=Check file modification time every minute

		default['cluster']['shared_update_path'] = "#{node['cluster']['shared_dir']}/check_update"
		default['cluster']['update_checkpoint'] = "#{node['cluster']['scripts_dir']}/update_checkpoint"

Replace cfn-hup on compute nodes with systemd timers to signal updates #3070

Are you sure you want to change the base?

Replace cfn-hup on compute nodes with systemd timers to signal updates #3070

Uh oh!

Conversation

hgreebe commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

References

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarciani Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hgreebe commented Dec 17, 2025 •

edited

Loading

gmarciani Dec 22, 2025 •

edited

Loading